Indexing methods
EMu uses two basic indexing methods:
The linear hashing method is used to provide key based lookup. A key is a unique value used to identify a record. In EMu this is the IRN (Internal Record Number). Not only must key searching be fast, it must also enforce uniqueness. It must not be possible to have two records with the same key (IRN). The linear hashing algorithm ensures that a record can be located via the IRN in an average of 1.1 disk accesses, which is very efficient. It also ensures key uniqueness.
Many institutions have numbering sequences that uniquely identify an object or event. In some instances the number is almost always unique apart from a few exceptions. EMu allows some fields to be designated as unique, ensuring that when a record is saved the value in the field has not been entered before. Sometimes this check may need to be relaxed to allow "duplicate" numbers to be saved. The unique check performed on these fields also uses the linear hash method.
While the Two Level method dictates how a search should be executed internally, it is a flexible method that supports a wide number of searching variations within the one framework. For example, EMu provides word based searching for all fields. Simply type in one or more words in any order into the field to be searched, and matches will be quickly retrieved. EMu also supports phonetic based searching via the @
operator. If a word is preceded with \@, all words that sound like the word entered will be found. Phonetic searching uses the same indexing as word based searching even though there are different results. The difference is in the definition of a term:
- For word based searching a term is a word.
- For phonetic searching a term is a series of numbers that represents the sound of the word (using a complex algorithm that converts a word into its basic sound groups).
Indexing methods introduced with EMu 3.1 increased support for wildcard searches:
- Null based indexing provides very fast retrieval when searching for fields that are empty / not empty.
- partial based indexing offers full index support for wildcard queries that specify leading letters.
Both these methods use the Two Level scheme. They simply provide different terms based on the type of indexing required.
The basic searching provided in EMu is word based. A term in word based searching is a sequence of alphabetic and numeric characters up to a word break character. A word break character consists of all punctuation characters except for apostrophes (') and underscores (_), and all space characters. For example the string:
Relax. You know you're in safe hands.
consists of the words:
relax
you
know
youre
in
safe
hands
Notice how the punctuation is removed and the case is converted to lower case (searching is case insensitive). The words above are the search terms that can be used to locate the sample string. When word based indexing is used, extra terms are generated for word pairs (two consecutive words within the same sentence). For the above sample the word pairs generated are:
relax you
you know
know youre
youre in
in safe
safe hands
The word pairs provide direct support for phrase based searching, that is a search where the words are enclosed in double quotes (e.g. "in safe hands"). Enabling word based searching automatically provides phrase based searching.
In many instances it may be useful to search for variations on a base word regardless of the tense or form of the word used. Consider a search for the word election:
- A word based search will return all occurrences of the word election, but it will not find elect, elected, elector, electing, etc.
- A stem based search, which is specified by preceding a word with a tilde (
~
), will find all variations of the word.
In stem based indexing the basic term is the root of the word (in this case elect). Entering any variation of the word preceded by \~ will result in a search for the root word (e.g. searching for \~elected will result in the search term elect). The algorithm to convert a word to its root word is a complex one that has been refined over many years. It is not a simple truncation mechanism, otherwise \~elect would match electric. The rules used to determine the root word are effective for English text only.
Stem based indexing is not enabled on many fields in EMu. If it is not enabled, stem based searches are still possible, however the exhaustive search method is used. In general, fields that contain descriptive text or notes based fields are good candidates for stem indexing.
When searching for names or scientific terms it is often useful to be able to search for records that contain words that sound like the search term. phonetic based indexing provides this functionality. The basic term for phonetic searching is a number sequence. Each word is converted into a number sequence that encodes the basic sound groups that make up the word. The number sequence is then used for the search and words that generate the same number sequence (and hence sound the same) will be matched.
A phonetic search is specified using the @
character before a word, e.g. \@term.
The algorithm used to convert a word into its basic sound group is very complex as letters can take on different sounds depending on the letters that surround them. A number of refinements to the basic algorithm have been made over many years.
Phonetic based indexing is not enabled on many fields in EMu. It is enabled on fields that typically contain names (e.g. first name and surname fields). If phonetic indexing is not enabled, phonetic based searching is still available, however the exhaustive search method is used.
Occasionally you may want a search to return records where only the exact value entered in a field is matched. String based indexing uses the complete contents of the field as the term for retrieval and only records that exactly match the search terms will be returned. For example, two records have either one or the other of the following values in a suburb field:
Melbourne
North Melbourne
A word based search using Melbourne
as the search term will return both records (as they both contain the word Melbourne). A String based search, using Melbourne will only return the first record as only one record contains Melbourne and nothing else.
String based searching is not used widely in EMu (as word based searching provides a more flexible and useful searching mechanism). It is enabled for the security fields (SecCanDisplay_tab, SecCanEdit_tab and SecCanDelete_tab) used by Record Level Security. Since Record Level Security uses group names and user names it is useful to use String based searching to avoid mismatches on names (e.g. on a group called Herpetology and another called Herpetology Managers).
Some forms of data may be searched by locating all records between start and end points, otherwise known as a range search. EMu provides range based searching for a number of data types: latitude, longitude, date, time and numeric (both integer and floating point) types. Range based searches are specified using the following relational operators:
- Less than (<)
- Less than or equal to (<=)
- Greater than (>)
- Greater than or equal to (>=)
To search for all records inserted between 1 January 2006 and 28 February 2006, in the Date Inserted field you would specify:
>="1 January 2006" <="28 February 2006"
Note the use (and position) of the double quotes as the dates contain spaces.
Range based searching is a variation on the standard Two Level scheme used by the other indexing methods. Essentially it divides the possible values for a range based field into a series of discrete intervals. Then, for each field, value terms are generated to indicate into which intervals the value falls. Using these terms it is possible to provide efficient range based searches.
Sometimes you may only be interested in whether a field contains a value or not. For example, to determine whether records have been assigned to a department, you would use the wildcard search !*
in the Department field (SecDepartment_tab).
NULL based searching provides an additional term that indicates whether the field contains a value or not. This term is used to provide high speed retrieval when certain wildcard searches are used:
\!\*
finds records where a field does not contain a value\*
finds records where a field contains a value
Null based searching can also provide support for searches used to determine whether an attachment field is empty or not:
\!\+
for records that do not have any attachments\+
for records that do have one or more attachments
Most wildcard based searches (i.e. searches looking for patterns or part of a word) specify leading letters. For example, you may want to find all people with a surname starting with Al*
. Since the standard EMu indexing is word based, wildcard based searches must use the exhaustive search method to locate matches. Such a search may take some time for very large numbers of records.
Partial based indexing provides a mechanism for providing very efficient wildcard searching where leading letters are specified (that is where one or more letters or digits appear at the start of the pattern). The method takes each word in a field and generates a term based on the number of leading letters indexed. Part of the partial based indexing configuration is the number of letters to be used to generate partial terms. For example, a field may be configured to provide partial indexing for the first 1,3,5
letters of each word. If we take the sentence:
Relax. You know you're in safe hands.
the following terms are generated for partial based indexing:
r
|
Thus if we searched for rel\* the three letter terms can be searched to locate the matching records. The searching mechanism makes use of any indexing it can; for example, if you search for re\* the single letter terms are used to retrieve a set of potential matches, which are then checked to see if the second letter matches. Partial based indexing provides the same level of response for wildcard based searches where leading letters are supplied as does word based searching.
Partial indexing is enabled on a small number of fields, typically fields containing names.
Solr indexing
A third indexing method, available with EMu 8.0, allows use of Apache Solr for searching rather than the default Texpress based indexing. Advantages of Solr indexing over Texpress indexing include:
- significantly lower disk usage;
- fast range-based searching for numeric, date, time, latitude and longitude values;
- fast wildcard (pattern) searching without the need for partial indexes;
- elimination of false matches (the number of matching records is always correct);
- no need for nightly / weekly maintenance to rebuild indexes;
- no configured limit on the number of records in a module; and
- no bit-slicing phase when a re-index is performed.
Details here.